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The active field of Functional Data Analysis (about understand- 
ing the variation in a set of curves) has been recently extended to 
Object Oriented Data Analysis, which considers populations of more 
general objects. A particularly challenging extension of this set of 
ideas is to populations of tree-structured objects. We develop an ana- 
log of Principal Component Analysis for trees, based on the notion 
of tree-lines, and propose numerically fast (linear time) algorithms to 
solve the resulting optimization problems. The solutions we obtain 
are used in the analysis of a data set of 73 individuals, where each 
data object is a tree of blood vessels in one person's brain. 

1. Introduction. Functional data analysis has been a recent active re- 
search area. See Ramsay and Silverman (2002, 2005) for a good introduction 
and overview, and Ferraty and Vieu (2006) for a more recent viewpoint. A 
major difference between this approach, and more classical statistical meth- 
ods is that curves are viewed as the atoms of the analysis, i.e. the goal is 
the statistical analysis of a population of curves. 

Wang and Marron (2007) recently extended functional data analysis to 
Object Oriented Data Analysis (OODA), where the atoms of the analysis are 
allowed to be more general data objects. Examples studied there include im- 
ages, shapes and tree structures as the atoms, i.e. the basic data elements of 
the population of interest. Other recent examples are populations of movies, 
such as are being subjects of functional magnetic resonance imaging. A ma- 
jor contribution of Wang and Marron (2007) was the development of a set 
of tree-population analogs of standard functional data analysis techniques, 
such as Principal Component Analysis (PCA). The foundations were laid 
via the formulation of particular optimization problems, whose solution re- 
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suited in that analysis method (in the same spirit in which ordinary PCA 
can be formulated in terms of optimization problems). 

Here the focus is on the challenging OODA case of tree structured data 
objects. A limitation of the work of Wang and Marron (2007) was that no 
general solutions appeared to be available for the optimization problems 
that were developed. Hence, only limited toy examples (three and four node 
trees, which thus allowed manual solutions) were used to illustrate the main 
ideas (although one interesting real data lesson was discovered even with 
that strong limitation on tree size). 

One of our main contributions is that, through a detailed analysis of the 
underlying optimization problem, and a complete solution of it, a linear time 
computational method is now available. This allows the first actual OODA 
of a production scale data set of a population of tree structured objects. 
Ideas are illustrated in Section 2 using a set of blood vessel trees in the 
human brain, collected as described in Aylward and Bullitt (2002). In the 
present paper, we choose to consider only variation in the topology of the 
trees, i.e. we consider only the branching structure and ignore other aspects 
of the data, such as location, thickness and curvature of each branch. 

Even with this topology only restriction, there is still an important cor- 
respondence decision that needs to be made: which branch should be put on 
the left, and which one on the right, see Section 2.1. Later analysis will also 
include location, orientation and thickness information, by adding attributes 
to the tree nodes being studied. A useful set of ideas for pursuing that type 
of analysis was developed by Wang and Marron (2007). 

In Subsection 2.2 we define our main data analytic concept, the tree- 
line, and the notion of principal components based on tree-lines. Here we 
also state, and illustrate our main result. Theorem 2.1, which will allow 
us to quickly compute the principal components. Subsection 2.3 is devoted 
to our data analysis using the blood vessel data: we carefully compare the 
correspondence approaches, and present our findings based on the computed 
principal components. In Section 3 we prove Theorem 2.1 along with a host 
of necessary claims. 

2. Data and Analysis. The data analyzed here are from a study of 
Magnetic Resonance Angiography brain images of a set of 73 human subjects 
of both sexes, ranging in age from 18 to 72, which can be found at Handle 
(2008). One slice of one such image is shown in Figure 1. This mode of 
imaging indicates strong blood flow as white. These white regions are tracked 
in 3 dimensions, then combined, to give trees of brain arteries. 

The set of trees developed from the image of which Figure 1 is one slice is 
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Fig 1 . Single Slice from a Magnetic Resonance Angiography image for one patient. Bright 
regions indicate blood flow. 



shown in Figure 2. Trees are colored according to region of the brain. Each 
region is studied separately, where each tree is one data point in the data set 
of its region. The goal of the present OODA is to understand the population 
structure of 73 subjects through 3 data sets extracted from them: Back data 
set (gold trees), left data set (cyan) and right data set (blue). One point to 
note is that the front trees (red) are not studied here. This is because the 
source of flow for the front trees is variable, therefore this subpopulation has 
less biological meaning. For simplicity we chose to omit this sub-population. 

The stored information for each of these trees is quite rich (enabling the 
detailed view shown in Figure 2). Each colored tree consists of a set of 
branch segments. Each branch segment consists of a sequence of spheres fit 
to the white regions in the MRA image (of which Figure 1 was one slice), 
as described in Aylward and Bullitt (2002). Each sphere has a center (with 
X, y, z coordinates, indicating location of a point on the center line of the 
artery), and a radius (indicating arterial thickness). 

2.1. Tree Correspondence. Given a single tree, for example the gold col- 
ored (back) tree in Figure 2, we reduce it to only its topological (connec- 
tivity) aspects by representing it as a simple binary tree. Figure 3 is an 



Fig 2. Reconstructed set of trees of brain arteries for the same patient as shown in Figure 
1. The colors indicate regions of the brain: Gold (back), Right (blue), Front (red). Left 
(cyan). 



example of such a representation. Each node in Figure 3 is best thought of 
as a branch of the tree, and the green hne segments simply show which child 
branch connects to which parent. The root node at the top represents the 
initial fat gold tree trunk shown near the bottom of Figure 2. The thin blue 
lines show the support tree, which is just the union of all of the back trees, 
over the whole data set of 73 patients. 

There is one set of ambiguities in the construction of the binary tree 
shown in Figure 3. That is the choice, made for each adult branch, of which 
child branch is put on the left, and which is put on the right. The following 
two ways of resolving this ambiguity are considered here. Using standard 
terminology from image analysis, we use the word correspondence to refer 
to this choice. 

• Thickness Correspondence: Put the node that corresponds to the 
child with larger median radius (of the sequence of spheres fit to the 
MRA image) on the left. Since it is expected that the fatter child vessel 
will transport the most blood, this should be a reasonable notion of 
dominant branch. 

• Descendant Correspondence: Put the node that corresponds to 
the child with the most descendants on the left. 

These correspondences are compared in Subsection 2.3. 
Other types of correspondence, that have not yet been studied, are also 
possible. An attractive possibility, suggested in personal discussion by Marc 
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Fig 3. Green line segments show the topology only representation of the gold (back tree) 
from Figure 2. Only branching information is retained for the OODA. Branch location 
and thickness information is deliberately ignored. Thin blue curve shows the union over 
all trees in the sample. 



Niethammer, is to use location information of the children in this choice. E.g. 
in the back tree, one could choose the child which is physically more on the 
left side (or perhaps the child whose descendants are more on average to the 
left) as the left node in this representation. This would give a representation 
that is physically closer to the actual data, which may be more natural for 
addressing certain types of anatomical issues. 

2.2. Tree-Lines. In this section we develop the tools of our main analysis, 
based on the notion of tree-lines. We follow the ideas of Wang and Marron 
(2007), who laid the foundations for this type of analysis, with a set of 
ideas for extending the Euclidean workhorse method of PCA to data sets 
of tree structured objects. The key idea (originally suggested in personal 
conversation by J. O. Ramsay) was to define an appropriate one dimensional 
representation, and then find the one that best fits the data. The tree-line 
is a first simple approach to this problem. 

First we define a binary tree: 

Definition 2.1. A binary tree is a set of nodes that are connected by 
edges in a directed fashion, which starts with one node designated as root, 
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Fig 4. Toy example of a data set of trees, T, with n = 3. This will be used to illustrate 
several issues below. 



where each node has at most two children. 
Using the notation tj for a single tree, we let 



denote a data set of n such trees. A toy example of a set of 3 trees is given 
in Figure 4. 

To identify the nodes within each tree more easily, we use the level-order 
indexing method from Wang and Marron (2007). The root node has index 
1. For the remaining nodes, if a node has index to, then the index of its left 
child is 2uj and of its right child is 2u> + l. These indices enable us to identify 
a binary tree by only listing the indices of its nodes. 

The basis of our analysis is an appropriate metric, i.e. distance, on tree 
space. We use the common notion of Hamming distance for this purpose: 

Definition 2.2. Given two trees ti and t2, their distance is 



where \ denotes set difference. 

Two more basic concepts are defined below; the notion of support tree 
has already been shown in Figure 3 (as the thin blue lines). 

Definition 2.3. For a data set T, given as in (2.1), the support tree, 
and the intersection tree are defined as 



(2.1) 



T — {ti, ...,tn} 



d{ti,t2) = \tl\t2\ + \t2\tl 



Supp(r) = utik 
int(r) = ntit*. 



Figure 7 shows the support trees of the data sets used in this study. Figure 
8 includes the corresponding intersection trees. 
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Fig 5. Toy example of a tree-line. Each member come from adding a node to the previous. 
Each new node is a child of the previously added node. Starting point (Iq ) is the intersection 
tree of the toy data set of Figure 4- 

The main idea of a tree-line (our notion of one dimensional representation) 
is that it is constructed by adding a sequence of single nodes, where each 
new node is a child of the most recent child: 

Definition 2.4. A tree-line, L = {£o,--- ,^m}, is a sequence of trees 
where lo is called the starting tree, and ii comes from ii-i by the addition 
of a single node, labeled Vi. In addition each fj+i is a child of Vi. 

An example of a tree-line is given in Figure 5. Insight as to how well a 
given tree-line fits a data set is based upon the concept of projection: 

Definition 2.5. Given a data tree t, its projection onto the tree-line 
L is 

Pl (t) = argmin{d(t,£)}. 

Wang and Marron (2007) show that this projection is always unique. This 
will also follow from Claim 3.1 in Section 3, whose characterization of the 
projection will be the key in computing the principal component tree-lines, 
defined shortly. 

The above toy examples provide an illustration. Let t^ be the third tree 
shown in Figure 4. Name the trees in the tree-line, L, shown in Figure 5, as 
^0,^1,^2/3- The set of distances from t^, to each each tree in L is tabulated 
as 

j 12 3 

ditsjj) 6 5 4 5 

The minimum distance is 4, achieved at j = 2, so the projection of ^3 onto 
the tree-line L is £2- 

Next we develop an analog of the first principal component (PCI), by 
finding the tree-line that best fits the data. 
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Definition 2.6. For a data set T, the first principal component 
tree-line, i.e. PCI, is 

LI = argmin ^ d{ti,PL{ti)) 

In conventional Euclidean PCA, additional components are restricted to 
lie in the subspace orthogonal to existing components, and subject to that 

restriction, to fit the data as well as possible. For an analogous notion in 
tree space, we first need to define the concept of the union of tree-lines, and 
of a projection onto it. 

Definition 2.7. Given tree-lines Li = . . . ,ii,p-^}, . . . , Lg = 

{iqfl, iq,i, . . . , (^q,pq}, their union is the set of all possible unions of members 
of Li through Lg : 

LiU---ULg = {ii,nU---U£g,iJiie{0,...,pi},...,ige{0,...,Pg}.} 
Given a data tree t, the projection of t onto Li U • • • U Lg is 

(2.2) PL,u-uL,{t)= argmin 

In our non-Euclidean tree space, there is no notion of orthogonality avail- 
able, so we instead just ask that the 2nd tree-line fit as much of data as 
possible, when used in combination with the first, and so on. 

Definition 2.8. For k > 1 the kth principal component tree-line is 
defined recursively as 

(2.3) Lfc = argmin V (i(ii,PL.u-uL* ,uL{ti)), 

and it is abbreviated as PCk. 

For the concept of PC tree-lines to be useful, it is of crucial importance 
to be able to compute them efficiently. We need another notion. 

Definition 2.9. Given a tree-line 

L = {io,il, ■ ■ ■ ,im} 

we define the path of L as 
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Fig 6. Weighted support tree illustrating Theorem 2.1 



Intuitively, a tree-line that well fits the data "should grow in the direction 
that captures the most information". Furthermore, the /cth PC tree-line 
should only aim to capture information that has not been explained by the 
first fc — 1 PC tree-lines. This intuition is made precise in the following 
theorem, which is the main theoretical result of the paper: 

Theorem 2.1. Let k > 1, and LI, ... , Ll_-^ be the first k — 1 PC tree- 
lines. For V £ Supp(T) define 



The proof of Theorem 2.1 is given in Section 3. Figure 6 is an illustration: 
the weight of a node is the number of times the node appears in the trees of 
Figure 4. The black edge is the intersection tree of the same data set. The 
maximum weight path attached to Int(T) is the red path, which gives rise 
to the tree-line of Figure 5, which is thus the first principal component of 
the data set of Figure 4. 

After setting the weights of the nodes on the red path to zero, the max- 
imum weight path attached to Int(T) becomes the green path, which by 
Theorem 2.1 gives rise to PC2. The usefulness of these tools is demonstrated 
with actual data analysis of the full tree data set. 



(2.4) 




Then the kth PC tree-line L|. is the tree-line whose path maximizes the sum 
of Wk weights in the support tree, i.e. J2vev* "^kiv). 



k 
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2.3. Real Data Results. This section describes an exploratory data anal- 
ysis of the set of n = 73 brain trees discussed above using these tree-line 
ideas. The principal component tree-lines are computed as defined in The- 
orem 2.1. Both correspondence types, defined in Section 2.1 are considered 
and compared. 

The different brain location types (shown as different colors in Figure 
2) are analyzed as separate populations (i.e. the n = 73 blue trees are 
first considered to be the population, then the n = 73 gold trees, etc.), 
called brain location sub-populations. This reveals some interesting contrasts 
between the brain location types in terms of symmetry. 



LEFT 



BACK 



RIGHT 



Thickness 
Correspondence 






Descenderrt 
Correspondence 






Fig 7. Support trees, for both types of correspondence (shown in the rows), and for three 
brain location tree types (shown in columns, corresponding to the colors in Figure 2). Shows 
that the descendant correspondence gives a population with more compact variation than 
the thickness correspondence. 



We first compare the two types of correspondence defined in Section 2.1 
using the concept of the support tree. This is done by displaying the support 
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trees each type of correspondence, and for each of the three tree location 
types (shown with different colors in Figure 2), in Figure 7. Note that all 
of the support trees for the descendant correspondence (bottom) are much 
smaller than for the thickness correspondence (top), indicating that the de- 
scendant correspondence results in a much more compact population. This 
seems likely to make it easier for our PCA method to find an effective rep- 
resentation of the descendant based population. 

Figure 7 already reveals an aspect of the population that was previously 
unknown: there is not a very strong correlation between median tree thick- 
ness of a branch, and the number of children. 

Figure 8 shows the first 3 PC tree-lines, for the three sub-populations 
(shown as rows), with the intersection tree as the starting tree, for the 
descendant correspondence. 

In the human brain, the back circulation (gold) arises from a single vessel 
(the basilar artery) and immediately splits into two main trunks, supplying 
the back sides of the left and right hemispheres. These two parts of the back 
circulation are expected to be approximately mirror-image symmetrical with 
both sides containing one main vessel and other branches stemming from 
that. Consequently, for each tree on the back data set if we imagine a vertical 
axis that goes through the root node, we expect the subtrees on both sides 
of the axis to be symmetrical with each other. 

The results of our model for the back subpopulation are consistent with 
this expectation. The main vessel of one of the hemispheres can be seen in 
the starting point (intersection tree) as the leftmost set of nodes, while the 
other main vessel becomes the first principal component. 

As for the left and right circulations (cyan and blue trees) of the brain, 
they are expected to be close to mirror images of each other. Unlike the case 
of the back subpopulation, in each of these circulations there is a single trunk 
from which smaller branches stem. For this reason the bilateral symmetry 
observed within the back trees is not expected to be found here. 

The fact that PCX's for left and right subpopulations are at later splits 
suggest that the earlier splits tend to have relatively few descendants. The 
remaining PC2 and PCS tree-lines do not contain much additional infor- 
mation by themselves. However, when we consider PC's 1,2 and 3 together 
and compare left and right subpopulations, i.e. compare the second and third 
rows of Figure 8, the structural likeliness is quite visible. It should also be 
noted that for both of the subpopulations all PC's are on the left side of the 
root-axis, indicating a strong bilateral asymmetry, as expected. 

The tree-lines, and insights obtained from them, were essentially similar 
for the thickness correspondence, so those graphics are not shown here. 
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Back: PC1 Back: PC2 Back: PCS 




Left: PC1 Left: PC2 Left: PC3 




Right: PC1 Right: PC2 Right: PCS 




Fig 8. Best fitting tree-lines, for different sub-populations (rows), and PC number 
(columns). Intersection trees are shown in black. 



Next we study the tree-line analog of the familiar scores plot from con- 
ventional PCA (a commonly used high dimensional visualization device, 
sometimes called a draftsman's plot or a scatterplot matrix). In that case, 
the scores are the projection coefficients, which indicate the size of the com- 
ponent of each data point in the given eigen-direction. Pairwise scatterplots 
of these often give a set of useful two dimensional views of the data. In the 
present case, given a data point and a tree-line, the corresponding score is 
just the length (i.e. the number of nodes) of the projection. Unlike conven- 
tional PC scores, these are all integer valued. 

Figure 9 shows the scores scatterplot for the set of left trees, based on the 
descendant correspondence. The data points have been colored in Figure 9, 
to indicate age, which is an important covariate, as discussed in Bullitt et 
al (2008). The color scheme starts with purple for the youngest person (age 
20) and extends through a rainbow type spectrum (blue-cyan-green-yellow- 
orange) to red for the oldest (age 72). An additional covariate, of possible 
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Fig 9. Scores Scatterplot for the Descendant Correspondence, Left Side sub-population. 
Colors show age, symbols gender. No clear visual patterns are apparent. 



interest, is sex, with females shown as circles, males as plus signs, and two 
transgender cases indicated using asterisks. 

It was hoped that this visualization would reveal some interesting struc- 
ture with respect to age (color), but it is not easy to see any such connection 
in Figure 9. One reason for this is that the tree-lines only allow the very lim- 
ited range of scores as integers in the range 1-10. A simple way to generate a 
wider range of scores is to project not just onto simple tree-lines, but instead 
onto their union, as defined in (2.2). Figure 10 shows a scatterplot matrix, 
of several union PC scores, in particular PCI vs. PCI U 2 (shorthand for 
PCI U PC2) vs. PCI U 2 U 3 vs. PCI U 2 U 3 U 4. This combined plot, called 
the cumulative scores scatterplot, shows a better separation of the data than 
is available in Figure 9. The PC unions show a banded structure, which 
again is an artifact that follows from each PC score individually having a 
very limited range of possible values. This seems to be a serious limitation 
of the tree-line approach to analyzing population structure. 
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As with Figure 9, there is unfortunately no readily apparent visual con- 
nection between age and the visible population structure. However, visual 
impression of this type can be tricky, and in particular it can be hard to see 
some subtle effects. 
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Fig 10. Cumulative Scores Scatterplot for the Descendant Correspondence, Left Side sub- 
population. 



Figure 11 shows a view that more deeply scrutinizes the dependence of 
the PCI score on age, using a scatterplot, overlaid with the least squares 
regression fit line. Note that most of the lines slope downwards, suggesting 
that older people tend to have a smaller PCI projection than younger peo- 
ple. Statistical significance of this downward slope is tested by calculating 
the standard linear regression p-value for the null hypothesis of slope. For 
the left tree, using the descendant correspondence, the p-value is 0.0025. 
This result is strongly significant, indicating that this component is con- 
nected with age. This is consistent with the results of Bullitt et al (2008), 
who noted a decreasing trend with age in the total number of nodes. Our 
result is the first location specific version of this. 
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Similar score versus age plots have been made, and hypothesis tests have 
been run, for other PC components, and the resulting p- values, for the left 
tree using the descendent correspondence are summarized in this table: 
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Fig 11. Scatterplot of PCI score versus age. Least squares fit regression Ime suggests a 
downward trend in age. Trend is confirmed by the p-value o/ 0.003 (for significance of slope 
of the line). 



Note that for the individual PCs, only PCI gives a statistically signifi- 
cant result. For the cumulative PCs, all are significant, but the significance 
diminishes as more components are added. This suggests that it is really 
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PCI which is the driver of all of these results. 

To interpret these results, recall from Figure 8, that for the left trees, PCI 
chooses the left child for the first 3 splits, and the right child at the 4th split. 
This suggests that there is not a significant difference between the ages in 
the tree levels closer to the root, however, the difference does show up when 
one looks at the deeper tree structure, in particular after the 4th split. This 
is consistent with the above remark, that for the left brain sub-population, 
the first few splits did not seem to contain relevant population information. 
Instead the effects of age only appear on splits after level 4. 

We did a similar analysis of the back and right brain location sub-populations, 
but none of these found significant results, so they are not shown here. How- 
ever, these can be found at the web site (n). 

We also considered parallel results for the thickness correspondence, which 
again did not yield significant results (but these are on the web site (^)). 
The fact that descendant correspondence gave some significant results, while 
thickness never did, is one more indication that descendant correspondence 
is preferred. 

One more approach to the issue of correspondence choice is shown in 
Figure 12. This shows the amount of variation explained, as a function of the 
order of the Cumulative Union PC, for both the thickness and the descendant 
correspondences, for the left brain location sub-population. The amount 
of variation explained is defined to be the sum, over all trees in the sub- 
population of the lengths of the projections. There are 5023 nodes in total for 
both correspondences. (The correspondence difference affects the locations 
of nodes, total count remains the same.) 

It is not surprising that these curves are concave, since the first PC is 
designed to explain the most variation, which each succeeding component 
explaining a little bit less. But the important lesson from Figure 12 is that 
the descendant correspondence allows PC A to explain much more population 
structure, at each step, than the thickness correspondence. 

In summary, there are several important consequences of this work: 

• In real data sets with branching structure, tree PCA can reveal inter- 
esting insights, such as symmetry. 

• The descendant correspondence is clearly superior to the thickness 
correspondence, and is recommended as the default choice in future 
studies. 

• As expected, the back sub-population is seen to have a more symmetric 
structure. 

• For the left sub-population there is a statistically significant structural 
age effect. 




Fig 12. Total number of nodes explained, as a function of Cumulative PC Number. Shows 
that the descendant correspondence allows PCA to explain a much higher proportion of 
the variation in the population than the thickness correspondence. 



• There seems to be room for improvement of the tree-line idea for doing 
PCA on populations of trees. A possible improvement is to allow a 
richer branching structure, such as adding the next node as a child of 
one of the last 2 or 3 nodes. We are exploring this methodology in our 
current research. 

3. Optimization proofs. This section is devoted to the proof of The- 
orem 2.1 with some accompanying claims. 

Claim 3.1. Let L = {i^, . . . ,irn} be a tree-line, and t a data tree. Then 
(3.1) Pi(t) =4u(tnT4). 

Proof: Since = ii-i U Vi, we have 



(3.2) d{t, 



d{t,ii-i)-l llviGt; 
d{t,£i-i) + l otherwise. 
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In other words, the distance of the tree to the hne decreases as we keep 
adding nodes of Vl that are in t, and when we step out of t, the distance 
begins to increase, so Claim (3.1) fohows. 

□ 

Claim 3.2. Let Li,...,Lq be tree-lines with a common starting point, 
and t a data tree. Then 

PL,U-UL,{t)=PLAt)^---^PL,{t)- 

Proof: For simpUcity, we only prove the statement for q = 2. Assume that 

L2 = {h,o,h,i, ■ ■ ■ ,h,p2} 
with £q = £ifi = £2,0, and 

(3.3) Vli = {Vl,l, . . • ,Wl,pJ, = {V2,l, ■ ■ ■,V2,p2}- 

Also assume 

(3.4) PL^t) = 

(3.5) PL^it) = i2,r,. 

For brevity, let us define 

(3.6) /(i,j) = d(t,£i,i U£2j) for 1 < i < pi, 1 < i < p2. 
Using Claim 3.1, (3.4) means 

(3.7) vi^i £ t, if i < ri, and vi^i t, if i > ri, 
hence 

/ooN fi^J) < /(« - l,j) if « < n; 

^ ' f{hj) > /(i-l,j)ifi>n. 

By symmetry, we have 

(oa\ •^(^'•?) - /(^>i - 1) if i < ^2; 

f{i,j) > f{i,j-l)-dj>r2. 

Overall, (3.8) and (3.9) imply that the function / attains its minimum at 
^ = j = ^2, which is what we had to prove. □ 
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Claim 3.3. Let S he a subset of Supp(T) which contains Iq. For v G 
Supp(r) define 



(3.10) ws{v) 



0, ifv^S, 
X^t^et- 1, otherwise 



Then among the treelines with starting tree Iq the one which maximizes 

^ \{VLUS)nti\ 

is the one whose path Vl maximizes the sum of the ws weights: J2v&Vl "^siv). 
Proof: For v G Supp(r), and a subtree t of Supp(T), let us define 



(3.11) 6{v,t) 
Then 



1, if u G t, 
0, otlierwise 



argmaxX;t,eTl(^LUS')nti| = argmaxX;t,gT E^jeVLUS ^(^'> ^i) 

= argTaaxJ2veVLUsT.u&T^(.v,ti) 

= arg max X;,,eyr ws{v). 

□ 

Finally, we prove our main result: 
Proof of Theorem 2.1: For better intuition, we first give a proof when 
k = 1. Using Claim 3.1 in Definition 2.6, we get 

L\ = argmin ^ d(ti,4u(tin Vl)). 

Since Vl is disjoint from io, 

LI = argmax ^ |Vl n til, 

the statement follows from Claim 3.3 with 5 = 0. 
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We now prove the statement for general k. For an arbitrary data tree t, 
and tree-line L, we have 

(3.12) = £oU{VLint)u---u{VLi^^nt)u{VLnt) 

= eoU[{VLiu---uVLi_^uVL)nt], 

with the first equation from Claim 3.2, the second from Claim 3.1, and the 
third straightforward. 

Combining (3.12) with (2.3) we get 

(3.13) LI = argmin ^ d{U,io U [{{Vl* U • • • U Vli_^ U Vl) D U]). 

Again, the paths of LI, . . . ■,L\_^ and L are disjoint from io, so (3.13) be- 
comes 

(3.14) LI = argmax ^ K^^. U • • • U Vli_^ U Vl) n U\, 

SO the statement follows from Claim 3.3 with S = Vj* U • • • U Vj* . □ 

^1 ^k-i 
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